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STEERING  POLICIES  FOR  MARKOV  DECISION  PROCESSES 


UNDER  A  RECURRENCE  CONDITION 

by 

Dye-Jyun  Ma1  and  Armand  M.  Makowski2 

Electrical  Engineering  Department  and  Systems  Research  Center 
University  of  Maryland,  College  Park,  Maryland  20742 

ABSTRACT 

This  paper  presents  a  class  of  adaptive  policies  in  the  context  of  Markov  decision  processes 
(MDP’s)  with  long-run  average  performance  measures.  Under  a  recurrence  condition,  the  proposed 
policy  alternates  between  two  stationary  policies  so  as  to  adaptively  track  a  sample  average  cost  to 
a  desired  value.  Direct  sample  path  arguments  are  presented  for  investigating  the  convergence  of 
sample  average  costs  and  the  performance  of  the  adaptive  policy  is  discussed.  The  obtained  results 
are  particularly  useful  in  discussing  constrained  MDP’s  with  a  single  constraint.  Applications 
include  a  wide  class  of  constrained  MDP’s  with  finite  state  space  (Beutler  and  Ross  1985),  an 
optimal  flow  control  problem  (Ma  and  Makowski  1987)  and  an  optimal  resource  allocation  problem 
(Nain  and  Ross  1986). 
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Large  classes  of  engineering  problems  can  be  cast  as  Markov  decision  processes  (MDP’s)  with 
long-run  average  performance  measures  and  in  many  situations,  the  analysis  identifies  a  (possibly 
randomized)  stationary  policy  /*  to  yield  the  desired  performance.  Unfortunately,  the  structural 
properties  of  the  policy  /*  often  prevent  its  implementability  owing  to  computational  difficulties 
inherent  to  its  definition  (Nain  and  Ross  1986)  or  to  insufficient  knowledge  of  the  model  parameters 
(Ma  1988).  In  fact,  in  many  applications  (Ma,  Makowski  and  Shwartz  1986),  solving  explicitly  for 
/*  turns  out  to  be  a  difficult  task  which  is  further  compounded  when  some  of  the  model  parameters 
are  not  exactly  known. 

Such  difficulties  naturally  point  to  the  need  for  an  implementation  theory  within  the  context 
of  MDP’s.  The  purpose  of  this  theory  is  to  develop  implementable  strategies  which  yield  the  same 
performance  as  the  policy  /*.  Here,  implementability  is  synonymous  with  the  availability  of  an 
algorithm  which  produces  on-line  control  values,  given  available  feedback  and  model  information. 
Such  implementation  issues  were  recently  discussed  in  (Makowski  and  Shwartz  1986a),  where  var¬ 
ious  methods  for  implementation  were  proposed. 

In  this  paper,  the  discussion  is  given  in  the  context  of  MDP’s  with  countable  state  space, 
under  some  recurrent  structures,  and  the  attention  is  focused  on  a  class  of  implementable  policies 
called  steering  policies.  More  concretely,  let  V  be  the  (desirable)  value  of  the  long-run  average  cost 
incurred  under  the  policy  /*,  and  let  g  and  g  denote  two  stationary  policies  (possibly  randomized). 
The  policy  g  (resp.  g)  overshoots  (resp.  undershoots)  the  requisite  performance  level  V  in  that  the 
policy  g  (resp.  g)  yields  a  value  for  the  long-run  average  cost  which  is  higher  (resp.  lower)  than 
V .  The  proposed  scheme  assumes  only  the  implementability  of  the  two  stationary  policies  g  and 
£,  and  adaptively  alternates  between  g  and  g  under  the  assumption  that  some  privileged  state  z  is 
visited  infinitely  often  under  both  policies  g  and  g.  The  decision  to  switch  policies  is  taken  only  at 
the  times  when  the  state  of  the  system  visits  the  privileged  state  z  so  as  to  adaptively  track  the 
sample  average  cost  to  the  value  V .  At  those  (random)  instants,  the  current  value  of  the  sample 
average  cost  is  compared  against  the  target  value  V .  If  the  sample  average  is  above  (resp.  below) 
the  value  V,  the  policy  £  (resp.  g )  will  be  used  until  the  next  visit  to  the  privileged  state.  Thus, 
between  two  consecutive  visits  to  that  particular  state,  one  and  only  one  of  the  two  policies  is  used. 

This  steering  policy  a  is  analyzed  under  the  assumption  that  for  both  policies  g  and  g ,  the 
privileged  state  z  is  recurrent  for  the  induced  Markov  chain  and  that  there  is  “no  escape  at  infinity”. 
It  is  shown  that  the  policy  a  indeed  steers  the  sample  cost  averages  to  the  desired  value  V ,  and 
under  additional  growth  conditions,  that  the  long-run  expected  averages  under  /*  and  a  coincide. 
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Direct  sample  path  arguments  are  presented.  They  take  advantage  of  the  very  form  of  the  steering 
policy  a  and  exploits  some  hidden  regenerative  properties  of  the  state  process  under  the  steering 
policy  a.  This  discussion  is  inspired  by  the  proof  of  the  Ergodic  Theorem  for  recurrent  Markov 
chains  based  on  the  Strong  Law  of  Large  Numbers  as  given  by  Chung  (1967). 

The  obtained  results  are  of  interest  in  the  context  of  constrained  MDP’s  with  a  single  con¬ 
straint,  where  an  optimal  stationary  policy  is  often  found  by  simple  randomization  between  two 
pure  stationary  policies  g  and  g  with  the  abovementioned  properties.  Typically,  these  two  pure 
policies  are  identified  through  Lagrangian  arguments  and  the  randomization  bias  is  chosen  so  as  to 
meet  the  constraint  value  (Beutler  and  Ross  1985,  Ma  and  Makowski  1987,  Nain  and  Ross  1986). 
The  very  form  of  this  solution  lends  itself  to  an  implementation  via  the  steering  policy  a ,  which 
requires  no  knowledge  of  the  randomization  bias  value,  and  as  such  can  be  viewed  as  an  indirect 
adaptive  policy  (Ma,  Makowski  and  Shwartz  1986).  The  steering  policy  a  considered  here  should 
also  be  contrasted  against  the  so-called  timesharing  implementation  of  /*  proposed  by  Altman  and 
Shwartz  (1986),  whereby  the  decision-maker  alternates  between  the  two  policies  <7  and  g  accord¬ 
ing  to  some  deterministic  (thus  non-adaptive )  mechanism  associated  with  the  recurrence  cycles. 
The  instrumentation  of  these  time-sharing  policies  requires  the  explicit  evaluation  of  certain  cost 
functionals,  so  that  the  proposed  steering  policy  a  could  be  interpreted  as  providing  an  adaptive 
version  of  time  sharing. 

The  work  reported  here  was  motivated  by  an  idea  proposed  by  Ross  (1985)  in  the  context 
of  an  optimal  resource  allocation  problem  with  a  constraint.  Ross  suggested  a  scheme  whereby 
the  decision-maker  could  possibly  switch  between  two  static  priority  assignments  at  any  decision 
epoch  so  as  to  steer  the  long-run  average  cost  to  the  value  V.  The  analysis  in  that  case  seems 
more  involved  and  as  of  the  writing  this  paper,  the  question  of  its  performance  still  remains  open. 
However,  in  some  specific  situations,  which  include  a  class  of  constrained  MDP’s  with  finite  state 
spaces,  the  results  obtained  here  translate  into  results  for  Ross  scheme. 

The  paper  is  organized  as  follows.  The  underlying  MDP  formulation  is  stated  in  Section  1.  The 
problem  of  steering  the  cost  to  a  specific  value  is  precisely  formulated  in  Section  2.1,  the  steering 
policy  a  is  introduced  in  Section  2.2  and  the  key  technical  assumptions  are  discussed  in  Section  2.3. 
The  main  results  of  the  paper  are  presented  in  Section  3.1,  and  are  proved  in  Section  3.3  using  some 
key  intermediate  results  which  are  summarized  in  Section  3.2.  While  the  proof  of  these  intermediate 
results  is  delayed  until  Section  5,  Section  4  first  outlines  applications  to  constrained  MDP’s.  The 
situation  of  finite  state  spaces  and  compact  action  space  is  discussed  in  Section  4.1,  while  problems 
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in  optimal  flow  control  and  resource  allocation  are  considered  in  Sections  4.2  and  4.3,  respectively. 
Section  5  closes  the  paper  with  a  detailed  discussion  of  the  sample  path  arguments. 

A  word  on  the  notation:  The  set  of  real  numbers  is  denoted  by  IR,  and  IN  denotes  the  set  of 
all  non- negative  integers.  The  indicator  function  of  any  set  E  is  simply  denoted  by  1  [E\.  Unless 
stated  otherwise,  limn,  linm  and  limn  are  taken  with  n  going  to  infinity. 

1.  The  model 

Consider  a  MDP  with  countable  state  space  S,  measurable  action  space  U,  and  Borel  measurable 
transition  kernel  (Pxyiu)),  i.e.,  the  mappings  pxy(*)  ■  U  — »  M  are  Borel  measurable  and  satisfy  the 
standard  properties  0  <  pxy(u)  <  1  and  ^2  pxy(u)  =  1  for  all  x  and  y  in  S  and  u  in  U .  The 
probabilistic  framework  for  this  MDP  is  defined  on  the  canonical  sample  space  D  :=  ( S  X  U)°°. 
An  element  cj  of  is  viewed  as  a  sequence  (xq,Uo,U)i, . . .)  with  xo  in  S  and  un  in  U  X  S  for  all 
n  =  0,1,. ..,  where  each  block  component  uin  is  of  the  form  (un,xn+ 1)  with  un  and  xn+\  elements 
of  U  and  S,  respectively.  The  information  spaces  {Mn}^  are  recursively  generated  by  IHq  :=  S 
and  .ffln+i  :=  IHn  XU  X  S  for  all  n  =  0, 1, . . .,  so  that  an  element  hn  in  HIn  is  uniquely  associated 
with  the  sample  u  by  hn  :=  (xo,wo, . . .  ,wn_i)  with  ho  :=  xo ■  The  interpretation  of  these  quantities 
is  as  follows:  When  the  sample  w  =  (x0, «o,<*>i, . . .)  is  realized,  the  system  is  in  state  xn  at  time  n, 
and  the  control  action  un  is  generated  according  to  some  prespecified  mechanism  on  the  basis  of 
the  information  vector  hn. 

The  coordinate  mappings  {tf(n)}g°  and  {X(n)}§°  are  defined  on  the  sample  space  fi  by  setting 
U(n,u )  :=  un  and  X(n,  u)  :=  xn  with  the  information  mappings  {H(n)} q°  given  by  H(n,  u)  := 
(a:o,  wo, u>i, . . .  ,wn_i)  =  hn  for  every  u  in  Q,  and  for  all  n  =  0, 1, . . .. 

For  every  n  =  0,1,...,  let  IFn  be  the  cr-field  generated  by  the  mapping  II (n)  on  the  sample 
space  Q  and  with  standard  notation,  IF  :=  V£L.0lFn  is  simply  the  natural  cr-field  on  generated 
by  the  mappings  {(t/(n),  X(n))}o°.  On  the  space  (Cl,  IF),  the  mappings  U(n),  X(n)  and  H(n )  are 
random  variables  (RV)  taking  values  in  U,  S  and  Mn,  respectively. 

Let  IM  denote  the  space  of  probability  measures  on  U ,  when  equipped  with  its  natural  Borel 
cr-field.  Since  randomization  is  allowed,  an  admissible  policy  7r  is  defined  as  any  collection  { r:n } “ 
of  mappings  7 rn  :  IHn  — »■  IM  such  that  the  mappings  IHn  — >  [0, 1]  :  hn  — >  w n(A;  hn )  are  im¬ 
measurable  for  every  Borel  subset  A  of  V .  For  each  hn  in  Mn,  the  quantity  7r„(»;  hn )  is  interpreted 
as  the  conditional  probability  distribution  of  selecting  the  control  value  at  time  n ,  given  that  the 
information  vector  hn  is  available  to  the  decision-maker.  In  the  sequel,  denote  the  collection  of  all 
such  admissible  policies  by  V. 
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Let  //(•)  be  a  fixed  probability  distribution  on  S.  Given  any  policy  7r  in  V,  there  exists  a 
unique  probability  measure  Pv  on  IF,  with  corresponding  expectation  operator  E* ,  satisfying  the 
requirements  (R1)-(R3),  where 
(Rl)  For  all  xo  in  S, 

P*[X( 0)  =  x0]  :=  fx(x0), 

(R2)  For  every  Borel  subsets  A  of  U, 

P*[U(n)  £  A\IFn]  =  Kn(A;H(n)),  n  =  0,1,... 

(R3)  For  all  y  in  S, 

Pv[X(n  +  1)  =  y\IFn  V  a(U(n))}  =  pX(n)y(U(n)).  n  =  0,1,. .. 

When  /t(«)  is  the  point  mass  distribution  at  x  in  S ,  this  notation  is  specialized  to  PJ  and  E 
respectively,  and  it  is  then  plain  that  P7r[A|X(0)  =  x]  =  PJ[A]  for  every  A  in  IF.  It  now  follows 
readily  from  (R2)-(R3)  that 

P*[X(n+  1)  =  y\JFn]  =  f  Trn(du-,II(n))pX(n)y(u )  n  =  0, 1, . .  .(1.1) 

Ju 

for  all  y  in  S. 

A  policy  7r  in  V  is  said  to  be  a  Markov  policy  if  there  exists  a  family  {<7ri}o°  of  mappings 

gn  :  S  — ►  IM  such  that  7Tn(*;  H(n ))  =  gn(*;  X(n))  Pn- a.s.  for  all  n  =  0, 1, In  the  event  gn—g 

for  all  n  =  0, 1, . . the  Markov  policy  7r  is  called  stationary  and  can  be  identified  with  the  mapping 
g  itself.  It  is  plain  from  (R1)-(R3)  that  for  each  stationary  policy  g,  the  RV’s  {X(n)}§°  form 
a  time-homogeneous  Markov  chain  under  P9 ,  with  corresponding  one-step  transition  probability 
matrix  P(g)  =  ( pxy(g ))  given  by 


Pxy(g)  :=  /  pxy{u)g(du;x) 

Ju 


(1.2) 


for  all  x  and  y  in  S. 

A  policy  7r  in  V  is  said  to  be  a  pure  (or  non-randomized )  policy  if  there  exists  a  family  {/n}o° 
of  mappings  fn  :  PIn  —*  U  such  that  for  every  Borel  subset  A  of  U ,  7t„(A;  H(n))  —  1  [fn(H(n))  G  A] 
Pn- a.s.  for  all  n  =  0, 1, . . ..  A  pure  Markov  stationary  policy  7r  in  V  is  thus  fully  characterized  by 
a  single  mapping  f  :  S  -*  U. 
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For  any  mapping  c  :  S  -»  M,  the  long-run  average  cost  Jc( ir)  incurred  by  the  admissible  policy 
7r  in  V  is  defined  by 

1  n— 1 

Jc(tt)  Y>(X(f))  (1-3) 

n  7^o 

whenever  meaningful,  and  for  future  reference,  introduce  the  corresponding  sample  average  costs 
{ Jc(n)}f)  which  are  given  by 

1  n— 1 

Jc{n):=-YJ<X{t)).  n  —  1,2,...  (1.4) 


2.  Implementation  via  Steering  Policies 

2.1.  The  problem 

Start  with  a  mapping  d  :  S  -+  IR,  and  let  the  constant  V  represent  the  desired  performance 
level  for  the  long-run  average  cost  (1.3)  associated  with  d.  The  discussion  assumes  the  existence  of 
two  stationary  policies  ~g  and  g_  such  that 

J\g)  <V<  Jd(g).  (2.1) 

The  motivation  for  such  an  assumption  can  be  found  in  the  theory  of  constrained  MDP’s  (Ma, 
Makowski  and  Shwartz  1986,  Ross  1985).  The  problem  (Py)  of  interest  in  this  paper  is  then 
formulated  as 

(Py)  :  Find  a  policy  a  in  V  such  that  Jd(a )  =  V. 

Under  the  condition  (2.1),  several  solutions  to  this  problem  are  known  and  are  briefly  surveyed 
in  (Makowski  and  Shwartz  1986a).  However,  as  pointed  out  there,  some  of  these  solutions  may  not 
be  readily  implementable  given  available  model  and  feedback  information.  It  is  the  purpose  of  this 
paper  to  present  and  analyze  yet  another  way  to  solve  the  problem  (Py),  the  key  feature  of  the 
proposed  solution  being  the  minimal  amount  of  information  required  for  its  implementation. 

2.2.  Steering  policies 

The  policy  a  proposed  here  is  of  the  form 

an(»;H(n))  :=  X(n))  +  (1  -  rj(n))g(»]  X(n))  n  =  0, 1,.  ..(2.2) 

where  {^(n)}^°  is  a  sequence  of  {0,  l}-valued  RV’s  to  be  specified  shortly.  In  other  words,  the 
policy  a  alternates  between  the  two  policies  g  and  g,  with  the  quantity  rj(n )  specifying  which  one 
of  these  two  policies  is  to  be  used  in  the  time  slot  [n,  n  -f  1). 
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The  policy  a  proposed  and  analyzed  in  this  paper  finds  its  origin  in  an  idea  proposed  by  Ross 
(1985,  pp.  126)  in  the  context  of  an  optimal  constrained  resource  allocation  problem.  In  order  to 
steer  the  long-run  average  cost  to  the  requested  value  V,  Ross  suggested  a  scheme  whereby  the 
decision-maker  alternates  between  the  policies  g  and  g_  so  that  the  sample  averages  \  Jd{n)'\r{"  track 
the  value  V .  This  policy,  denoted  hereafter  by  dR ,  also  has  the  form  (2.2)  but  uses  a  sequence 

{7?R(n))o°  §iven  by 

m(n)  =  l[Jd(n)  <  V]  n  =  1,2,... (2.3) 

with  t]r( 0)  arbitrary  in  {0,1}.  This  idea  was  subsequently  adapted  by  Makowski  and  Shwartz 
(1986a),  and  by  Ross  (1988)  to  a  more  general  class  of  MDP’s. 

The  analysis  of  the  performance  of  the  policy  otR  appears  quite  involved,  and  to  the  authors’ 
knowledge,  few  results  are  available  on  this  issue.  While  the  general  case  (Makowski  and  Shwartz 
1986a,  Ross  1985)  is  still  open,  various  special  situations  have  been  handled  successfully.  Makowski 
(1987)  treated  the  i.i.d.  case  by  viewing  the  sample  averages  {•/d(u))i°  as  the  output  values 
of  a  stochastic  approximations  algorithm  of  the  Robbins-Monro  type.  Ma  (1988)  in  his  Ph.D. 
dissertation  solved  the  problem  in  the  context  of  a  simple  flow  control  problem  for  discrete-time 
M/M/1  systems.  A  careful  examination  of  the  analysis  carried  in  (Ma  1988)  leads  very  naturally 
to  the  policy  a  investigated  in  this  paper. 

The  definition  of  the  policy  a.  will  require  that  the  assumptions  (Al)  be  enforced,  namely 
(Al)  The  Markov  chain  {X(n)}g°  has  a  single  recurrent  class  under  each  one  of  the  policies 
g  and  g.  These  recurrent  classes  have  a  non-empty  intersection,  and  moreover  starting 
from  any  transient  state  (if  any),  the  time  to  absorption  in  the  recurrent  class  is  a. s  finite 
under  each  policy. 

Let  z  denote  any  state  in  S  which  is  recurrent  under  both  g  and  g.  By  virtue  of  assumption 
(Al),  such  a  state  z  clearly  exists  and  has  the  property  that  the  system  returns  to  it  infinitely  often 
under  each  policy.  The  RV’s  {^(n)}§°  entering  the  definition  of  a  are  recursively  generated  by  the 
simple  relation 

V(n)  =  l[X(n)  -  z\l[Jd(n)  <  V]  +  1  [X(n)  ±  z\rj(n  -  1),  n  =  1,2,... (2.4) 

with  77(0)  arbitrary  in  {0, 1}.  This  policy  a  operates  according  to  one  of  the  policies  g  and  g  during 
each  cycle,  where  a  cycle  is  defined  as  the  time  duration  between  two  consecutive  visits  of  the 
process  {X(n)}g°  to  the  recurrent  state  z.  The  essential  difference  between  the  policies  a  and  or 
is  that  although  both  track  the  sample  cost  averages  {J^n)}^  about  the  value  V,  the  decision  to 
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switch  policies  may  be  taken  at  every  time  instant  under  while  only  at  successive  recurrence 
times  (to  the  state  z)  under  a. 

2.3.  The  assumptions 

Let  g  denote  any  one  of  the  policies  ~g  and  g,  unless  otherwise  specified.  The  first  return  time 
to  the  state  £  is  the  RV  T  defined  by 

T  :=  inf{n  >  1  :  X(n )  =  z}.  (2-5) 

The  assumption  (Al)  essentially  amounts  to  saying  that 
(Al)  For  all  x  in  S, 

P°[T  <  oo]  =  1. 

For  any  mapping  c  :  5”  — >  JR,  it  will  be  convenient  to  define  the  corresponding  cost  per  cycle  Zc  by 

Zc:=f^c(X(t))  (2.6) 

t= o 

(whenever  meaningful);  observe  that  under  (Al)  the  RV  Zc  is  P9- a.s  well  defined  and  finite.  In 
order  to  study  the  performance  of  the  policy  a,  the  following  additional  technical  assumptions 
(A2)-(A4)  will  be  needed,  where 

(A2)  The  mean  recurrence  time  to  the  state  z  is  finite  under  P9 ,  i.e., 


E9Z[T }  <  oo, 

(A3)  The  expected  cost  over  a  cycle  Zd  is  finite  under  P9 ,  i.e., 

EiUZdH  =  E‘H  £  <i(X(l))|]  <  oo, 

t= 0 


(A4)  The  equality 


takes  place,  where 


1  n— 1 

Jd(g)  :=  =  /dG?) 


t= o 


I\9)  ■■= 


E9[Zd] 
Ez[T]  ' 


(2.7) 


The  assumption  (A2)  implies  the  state  a  to  be  positive  recurrent  under  P9 ,  whence  the  Markov 
chain  {X(n)}g°  under  P 9  has  a  unique  invariant  measure.  Under  (A1)-(A3),  the  renewal  arguments 
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given  in  Chung  (1967)  shows  that  the  sequence  {/d(n)})>0  has  a  P9- a.s.  finite  limit  which  is  given 
by 

lim „Jd(n)  =  l|p 3  =  ld(g).  P‘  -  a.s.( 2.8) 

Moreover,  if  the  RV’s  {d(X(n))}§°  are  uniformly  integrable  under  P9,  then  (A4)  is  automatically 
guaranteed  since  then  the  convergence  (2.8)  also  holds  in  Ll(Cl,IF,  P9)  (Chung  1967,  Thm.  4.5.4). 
In  that  case,  the  quantity  Id(g )  also  coincides  with  the  expected  value  of  the  RV  d{X)  under  the 
invariant  measure  induced  by  P9 ,  where  X  denotes  a  generic  S'- valued  RV. 


3.  The  results 

3.1.  Performance  of  the  steering  policy 

The  main  results  of  this  paper  are  stated  in  Theorems  3. 1-3. 3  below,  and  are  proved  in  Section 
3.3  using  some  key  intermediate  results  which  are  summarized  in  Section  3.2.  To  state  the  results, 
let  the  RV  p{n)  given  by 

..  n— 1 

P(n):=~H^  n  =  1,2,... (3.1) 

II  : 

£= 0 

denote  the  fraction  of  time  over  [0,  n)  during  which  the  policy  p  is  used.  Set 


V  -  Jd(g) 

Jd(g )  -  Jd{g) 


(3.2) 


and  observe  from  (2.1)  that  0  <  p*  <  1. 

Theorem  3.1  Under  (A1)-(A4),  the  convergences 


lim„p(n)  =  p* 


Pa  -  a. 5. (3. 3) 


and 

1  71  —  1 

lim„Jd(n)  =  limn—  d(X(t))  =  V  Pa  —  a.s. (3.4) 

n  t=o 

take  place. 

Theorem  3.1  establishes  the  a.s.  convergence  of  the  sample  cost  averages  to  the  desired  value 
V ,  and  the  policy  a  will  indeed  constitute  a  solution  to  the  problem  (TV),  provided  some  additional 
integrability  conditions  hold  to  guarantee  convergence  of  the  mean.  One  possible  set  of  conditions 
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is  given  in  the  next  corollary  which  is  based  on  standard  facts  on  uniform  integrability  (Chung 
1967,  Thm.  4.5.4). 

Corollary  3.1  Under  (A1)-(A4),  whenever  the  RV’s  {d(A(n))}o°  are  uniformly  integrable  under 
Pa ,  the  convergence  (3-4)  also  takes  place  in  L1( Cl,IF,Pa)  and  consequently 


Jd(a)  =  limn 


1_ 

n 


n—1 

E“'E,d(X{  ())  =  V. 

t=  0 


(3.5) 


The  convergence  (3.4)  can  also  be  established  for  other  cost  mappings  c  :  S  IR  under  the 
assumption  (A3bis)  stated  below  which  is  similar  to  (but  weaker  than)  (A3),  provided  (3.3)  holds. 
To  state  the  condition,  denote  by  c+  and  c~  the  mappings  S  — »  JR  defined  by  c+(x)  :=  max(c(x),  0) 
and  c~(x)  max(— c(x),  0)  for  all  x  in  S. 

(A3bis)  The  cost  over  a  cycle  Zc  has  a  (possibly  infinite)  expectation  under  Pf ,  or  equivalently, 
the  quantities  Ef.[Zc+]  and  E%[ZC  ]  are  not  both  infinite  under  Pf . 

Set 

Tc(  ,  Egz[Zc]  ( 

I  W  -  -glpy  (3.6) 

Theorem  3.2  Assume  (3.3)  to  hold  and  let  the  mapping  c  :  S  ->  1R  satisfy  (A3bis).  If  the  quantity 
Ic(g)  +  Ic(g)  is  well  defined  (but  possibly  infinite),  then  the  convergence 

limn  Jc{n)  =  p*Ic(g)  +  (1  -  p*)Ic(g)  Pa  -  a.s.( 3.7) 


takes  place. 

If  the  assumption  (A3bis)  is  strengthened  to  (A3)-(A4)  (with  c  replacing  d),  and  if  the  RV’s 
(c(A(n))}o°  are  uniformly  integrable  under  Pa,  then  the  quantity  on  the  righthand  side  of  (3.7)  is 
finite,  and  the  convergence  (3.7)  holds  also  in  T1(fl,  IF,  Pa). 

Corollary  3.2  Assume  (3.3)  to  hold  and  let  the  mapping  c  :  S  — >■  iR  satisfy  assumptions  (A3)-(A4). 
If  the  RV’s  {c(A(n)))o°  are  uniformly  integrable  under  Pa ,  the  convergence  (3.7)  also  takes  place 
in  X1(0,  IF,  Pa),  and  consequently 


n—1 


Jc{a)  =  lim n-Ea  Y,  d(X(t))  =  p*J%g )  +  (1  -  p*)Jc(£). 


(3.8) 


t= o 


This  result  will  be  particularly  useful  in  discussing  constrained  MDP’s  in  the  next  section. 
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Although  this  paper  is  devoted  essentially  to  the  study  of  the  policy  a,  the  results  obtained 
here  have  implications  for  the  policy  otR  in  some  special  yet  important  cases.  Such  a  situation  is 
discussed  in  the  next  proposition. 

Theorem  3.3  If  the  policies  g  and  g  coincide  in  all  but  one  state,  say  xq  in  S ,  which  is  recurrent 
under  each  policy,  then  the  policy  ccr  coincides  with  the  policy  a  defined  by  (2.2)  and  (2.4)  with 
z  =  x o,  and  consequently  Theorems  3. 1-3. 2  and  their  corollaries  hold  for  the  policy  or  under 
appropriate  assumptions. 

The  situation  of  Theorem  3.3  occurs  in  a  wide  class  of  constrained  MDP’s  with  finite  state 
space  and  in  some  other  problems  as  well,  as  illustrated  in  Section  4. 

3.2.  Convergence  along  recurrence  times 

The  intermediate  results  which  are  useful  in  establishing  the  main  Theorems  3. 1-3. 2  are  sum¬ 
marized  in  this  section.  They  are  motivated  by  the  very  form  of  the  steering  policy,  and  represent 
the  main  technical  ingredients  of  the  paper.  A  complete  discussion  of  their  analysis  is  delayed  until 
Section  5. 

Recall  that  the  very  form  of  the  steering  policy  a  forces  the  decision  for  switching  between 
policies  to  be  taken  only  at  the  times  the  state  process  visits  the  state  z.  This  suggests  that  the 
behavior  of  the  control  algorithm  might  be  fully  determined  by  the  properties  of  the  sample  average 
cost  sequence  taken  only  along  these  recurrence  epochs. 

To  that  end,  consider  the  state  z  in  S  entering  the  definition  (2.4),  and  recursively  define  the 
recurrence  time  sequence  {r(A;)}^°  of  IN  U  {oo}-valued  RV’s  by 

{inf {t  >  r(k)  :  X(t)  =  zj  if  the  set  is  non-empty; 

Jb  =  0,1,  ...(3.9) 

oo  otherwise 

where  r(0):  =  0.  With  this  notation,  the  interval  [r{k  -  l),r(fc))  is  simply  the  kth  cycle. 

The  recurrence  condition  (Al)  and  the  definition  of  the  steering  policy  a  lead  readily  to  the 
following  intuitive  fact,  the  proof  of  which  is  omitted  for  sake  of  brevity. 

Lemma  3.4  Assume  the  recurrence  condition  (Al)  to  hold.  The  RV’s  r(k)  are  Pa-a.s.  finite  for  all 
k  —  1,2, . . .,  or  equivalently,  the  state  process  {X(n)}£°  visits  the  state  z  infinitely  often  under  Pa . 
Moreover,  under  the  additional  assumptions  (A2)-(A4),  the  steering  policy  a  alternates  infinitely 
often  between  the  two  policies  g  and  g. 
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The  key  intermediate  results  for  proving  Theorems  3. 1-3.2  are  summarized  in  the  next  propo¬ 
sition.  Set 


P*E§-[T] 

(1-P*)EI[T}  +  p*Et[T}' 


(3.10) 


Theorem  3.5  Assume  (A1)-(A4)  to  hold.  The  convergence 

limfcp(r(A;))  =  p*  Pa  -  a. s. (3. 11) 

takes  place,  and  for  any  mapping  c  :  S  —*  1R  satisfying  (A3bis),  the  convergence 


limfcJc(r(A0)  =  P*Ic(g)  +  (1  -  P*)Ic(g)  Pa  -  a. 5. (3. 12) 

takes  place  whenever  the  quantity  Ic(]f)  +  Ic(g)  is  well  defined.  Moreover,  the  Law  of  Large  Numbers 
holds  true  in  the  form 


limfc^  =  q*Ef[T]  +  (1  -  q^E^T]. 


Pa  -  a.s.(3.13) 


When  applying  (3.12)  to  the  cost  mapping  d,  simple  algebraic  calculations  using  (A4),  (2.7) 
and  (3.2)  readily  yield 

lim kJd(r(k))  =  V.  Pa  -  a.s.(3.14) 

In  other  words,  (3.11)-(3.12)  yield  the  convergences  (3.3)-(3.4)  and  (3.7)  along  the  resurrence  times. 
Although  the  convergence  (3.13)  presents  a  similar  version  of  the  Law  of  Numbers,  it  should  be 
noted  that  the  recurrence  times  {r( k )}j°  do  not  form  a  renewal  sequence  under  Pa . 

3.3.  A  proof  of  Theorems  3. 1-3. 2 

The  proof  of  Theorems  3. 1-3. 2  is  now  easily  recovered  from  Theorem  3.5.  Let 

k(n)  :=  max{fc  >  0  :  r(k)  <  n}  n  =  1, 2, . . .  (3.15) 

be  the  number  of  cycles  over  the  horizon  [0,  n )  including  the  one  in  progress  at  time  n.  It  is  plain 
from  Lemma  3.4  that 

lim„A:(n)  =  oo.  Pa  —  a.s.(3.16) 

For  each  n  =  1,2,...,  r(k(n))  <  n  <  r(k(n)  +  1)  so  that  for  any  non-negative  mapping  c  :  S  IR, 

-k~^Jc{T(k(n)))  <  Jc(n )  <  T(k^+1hc(r(k(n)  +  1)),  (3.17) 
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and  similarly, 


T(k(n^-p(r(k(n)))  <  p(n )  <  +  %(r(k(n)  +  1)). 


n 


n 


(3.18) 


By  the  Law  of  Large  Numbers  (3.13),  it  is  clear  that 


r(k)  jr(fc)  k  +  1  _ 

imfcr(fc+l)  limfc  fe4Tr(^  +  1)  k 


Pa  -  a.s.(3.19) 


Since 


r00))  <  r(Kn))  < 

r(/c(n)  +  1)  —  n 


(3.20) 


it  is  now  plain  from  (3.16)  and  (3.19)-(3.20)  that 


=  limnh*(")±l)  =  i. 


n 


n 


Pa  -  a. s. (3.21) 


By  virtue  of  (3.11)-(3.12),  the  inequalities  (3.17)-(3.18)  and  the  convergence  (3.21)  yield  the  con¬ 
vergences  (3.3),  (3.4)  and  (3.7)  for  non-negative  mappings. 

For  a  general  cost  mapping  c,  start  with  the  decomposition  Jc(n )  =  Jc+  (n)  —  Jc  ( n )  for  all 
n  =  1,2, . ..,  and  apply  the  result  for  non-negative  mappings  developed  above,  so  that 


lim„Jc+(n)  =  p*Ic+(g)  +  (1  -  p*)Ic+ (g) 

and 

lim nJc  (n)  =  P*I°  (g)  +  (1  -  P*)IC  (g)- 
It  is  now  plain  under  the  enforced  assumptions  that 

limnJc(n)  =  lim„ Jc+(n)  —  limnJc  (n) 
=  p*Ic(g )  +  (1  -  p*)I\g) 


Pa  -  a. s. (3. 22a) 


Pa  -  a. s. (3. 226) 


Pa  -  a.s.( 3.23) 


and  the  proof  of  Theorems  3. 1-3.2  is  therefore  complete. 


4.  Applications  to  Constrained  MDP’s 

This  section  is  devoted  to  various  applications  of  Theorems  3. 1-3.3  and  their  corollaries  to 
constrained  MDP’s.  Let  c  and  d  be  two  mappings  S  -*  1R  and  for  every  V  in  M ,  define  the  set  Vv 
of  constrained  policies  by 

Vv  :=  {?r  in  V  :  Jd( tt)  <  V).  (4.1) 
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The  constrained  MDP  ( CPv )  is  then  formulated  as 

( CPy )  :  Minimize  Jc( 7r)  over  Py. 

In  the  three  situations  discussed  here,  this  constrained  MDP  is  solved  by  Lagrangian  argu¬ 
ments:  For  every  7  >  0,  define  the  mapping  57  :  S  — »•  JR  :  x  —*  b^(x)  =  c(x)  +  7 d(x),  and  consider 
the  corresponding  unconstrained  Lagrangian  problem  (XP7),  where 

(XP7)  :  Minimize  (k)  over  V. 

In  each  example,  under  appropriate  hypotheses,  there  exist  two  pure  stationary  policies  g  and  g 
which  both  solve  the  same  Lagrangian  problem  (XP7  )  for  some  7*  >  0,  i.e., 

Jbl\g)  =  Jb'\g)=  inf  Jb'\w),  (4.2) 

—  7t£V 

and  which  satisfy  the  cost  inequalities 

Jd(g_)  <V<  Jd(g).  (4.3) 

For  0  <  77  <  1,  let  the  randomized  policy  P  be  the  stationary  policy  defined  by  p  :=  ggp  (1  —  g)g. 
If  the  mapping  g  —>  Jd(fTI)  is  continuous,  the  equation 

Jd{P )  =  V,  0  <  7?  <  1  (4.4) 

has  at  least  one  solution,  say  77*,  in  view  of  (4.3).  The  constrained  problem  (CPv)  is  then  solved 
by  the  stationary  policy  f*  =  fv  ,  provided  (i)  /*  solves  the  Lagrangian  problem  (XP7  )  and  (ii) 
both  functionals  Jc(/*)  and  Jd(f*)  exist  as  limits,  so  that 

(/*)  =  p(p)  +  7 =  inf  Jb (tt).  (4.5) 

In  that  case, 

Jd(f*)  =  V  and  JC(P)=  inf  Xc(tt)  (4.6) 

7r£7V 

by  standard  arguments  which  are  summarized  in  (Ma,  Makowski  and  Shwartz  1986,  Ross  1985). 

Consider  now  the  steering  policy  a  defined  in  Section  2.  Under  (A1)-(A4),  whenever  the  RV’s 
{d(X (n))}^  are  uniformly  integrable  under  Pa,  Corollary  3.1  and  (4.6)  yield  Jd(a)  =  Jd(f*)  —  V 
and  (3.3)  holds  by  Theorem  3.1.  Consequently,  if  the  mapping  c  also  satisfies  (A3)-(A4)  (thus  so 
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does  the  mapping  ZU*)  and  the  RV’s  {c(X(n))}g°  are  uniformly  integrable  under  Pa  (thus  so  are 
the  RV’s  {b^(X(n))}^),  then  Corollary  3.2  necessarily  implies  the  relation  Jc(a )  =  These 

remarks  are  summarized  in  the  next  proposition. 

Theorem  4.1  Suppose  the  problem  ( CPV )  admits  a  solution  f*  as  determined  via  (4.2)-(4.5). 
Under  (A1)-(A2),  if  (A3)-(A4)  hold  for  both  mappings  c  and  d,  and  if  the  RV’s  {c(A(n))}g°  and 
{d(X(n))} o°  are  uniformly  integrable  under  Pa ,  then  the  steering  policy  a  also  solves  the  problem 
(< CPV ),  with  Jc(a)  =  Jc(/*)  and  Jd(a )  =  Jd(f*)  =  V. 

4.1.  MDP’s  with  finite  state  spaces 

Beutler  and  Ross  (1985)  considered  MDP’s  under  the  following  assumptions  (H1)-(II3),  where 
(HI)  The  state  space  S  is  finite,  and  the  action  space  U  is  a  compact  metric  space , 

(H2)  For  every  pure  stationary  policy  f,  there  exists  a  common  state  z  in  S  which  is  accesible 
from  each  state  x  in  S  under  pf , 

(H3)  The  set  Vv  contains  at  least  one  pure  stationary  policy,  but  does  not  contain  any  pure 
stationary  policy  which  achieves  the  minimum  cost  Jc( n)  over  all  admissible  policies  tt  in 
V. 

Under  (H1)-(H3),  an  optimal  policy  f*  was  shown  to  be  determined  via  (4.2)-(4.5),  with  the 
randomization  to  be  performed  in  only  one  particular  state,  i.e.,  the  pure  policies  g  and  g  coincide 
in  all  but  one  state.  As  shown  by  Beutler  and  Ross  (1985),  (H2)  holds  for  all  randomized  stationary 
policies  as  well,  thus  implying  (Al).  The  state  space  being  finite,  the  costs  are  necessarily  bounded, 
so  that  the  assumptions  of  Theorem  4.1  are  immediately  satisfied,  and  the  optimality  of  the  steering 
policies  a  and  aR  easily  follows. 

Theorem  4.2  Under  (H1)-(H3),  the  steering  policies  a  and  an  (coincide  and)  solve  the  constrained 
problem  ( CPV )  with  Jc(a)  =  Jc(aR )  =  Jc(f*)  and  Jd(a )  =  Jd(aR)  =  Jd(f*)  =  V. 

4.2.  Optimal  flow  control 

Ma  and  Makowski  (1987)  considered  the  following  flow  control  model  for  discrete-time  M\M\1 
queues:  At  the  beginning  of  each  time  slot,  the  controller  decides  either  to  admit  or  reject  the 
potential  arrival  during  that  slot.  A  customer  (if  any)  may  fail  to  complete  service  in  a  slot  with 
fixed  probability  1  —  \x,  in  which  case  it  remains  at  the  head  of  the  line  to  await  service  in  the 
next  slot.  This  scenario  is  repeated  until  successful  service  completion  occurs,  at  which  time  the 
customer  leaves  the  system.  The  arrival  pattern  is  modelled  as  a  Bernoulli  sequence  with  parameter 
A,  independent  of  the  service  process  as  well  as  of  the  initial  queue  size.  Under  these  assumptions,  a 
MDP  formulation  with  state  space  S  —  IN  is  readily  obtained  by  taking  the  state  process  {X(n)}o° 
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to  be  the  queue  size  process. 

The  optimal  flow  control  problem  was  formulated  as  the  search  for  a  policy  that  maximizes 
the  throughput  subject  to  the  constraint  that  the  long-run  average  queue  size  does  not  exceed  a 
given  value  V.  Here,  the  throughput  and  the  average  queue  size  incurred  by  the  admissible  policy 
x  in  V  are  given  by 

1  71  —  1 

T(x)  :=  lim /  0]  (4.7) 

n  t= o 


and 


1  n— 1 

N(*):=^n-E”  £x(t), 


(4.8) 


t=  0 


respectively. 

This  constrained  MDP  can  be  cast  as  a  problem  of  the  form  ( CPy )  by  taking  Jc(x)  =  — T(x) 
and  Jd(x)  =  JV(x).  The  technical  assumptions  enforced  in  (Ma  and  Makowski  1987)  are  listed 
below  as  (H4)-(H5),  where 

(H4)  N(( oo,  1))  >  V,  where  (oo,  1)  denotes  the  policy  that  admits  every  single  customer , 

(H5)  For  every  policy  ir  in  P, 

oo 

j57r[X(0)j  =  ^  xp(x)  <  oo, 

X=0 

with  denoting  the  initial  queue  size  distribution. 

It  is  a  simple  matter  to  check  (Ma  and  Makowski  1987)  that  the  RV’s  {X(n)}jj°  are  uniformly 
integrable  under  Pa.  Under  these  assumptions,  the  constrained  optimal  control  problem  is  solved 
by  a  threshold  policy  /*  =  ( L*,r f)  with  N(f*)  =  V.  Here,  a  threshold  policy  ( L,ij ),  with  L  in  IV 
and  0  <  r]  <  1,  is  a  stationary  policy  which  at  the  beginning  of  each  time  slot  admits  (resp.  rejects) 
an  incoming  customer  if  the  queue  size  is  <  L  (resp.  >  L),  while  if  the  queue  size  is  exactly  L,  this 
new  customer  is  accepted  (resp.  rejected)  with  probability  g  (resp.  1  —  77). 

It  should  be  pointed  out  that  here  too  the  optimal  threshold  policy  /*  =  (L*,g*)  is  obtained 
as  a  randomization  with  bias  77*  between  the  pure  policies  g  =  (L*,  1)  and  g_  =  (Z*,0),  which 
are  identical  in  all  but  one  state,  the  state  where  there  are  L*  customers  in  the  system.  The 
assumptions  of  Theorem  4.1  now  hold.  The  states  {0,1,...,  L*}  are  all  recurrent  under  the  policies 
~g  and  g  so  that  any  element  in  the  set  {0, 1, . . . ,  L*}  can  be  selected  as  the  state  The  optimality 
of  the  corresponding  steering  policy  a  now  follows  immediately. 
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Theorem  4.3  Under  (H4)-(H5),  the  steering  policies  a  and  ocr  (coincide  and)  solve  the  constrained 
optimal  control  problem  ( CPv ),  with  T(a)  —  T(qlr)  =  T(/*)  and  N(a)  =  N(q>r)  =  N(f*)  =  V . 

4.3.  Optimal  resource  allocation 

Consider  a  system  of  K  + 1  infinite- capacity  queues  that  compete  in  discrete-time  for  the  service 
attention  of  a  single  server.  At  the  beginning  of  each  time  slot,  the  controller  gives  priority  to  one 
of  the  queues.  If  the  kth  queue  is  given  service  attention  during  that  slot,  with  probability  pk  the 
serviced  customer  (if  any)  completes  service  and  leaves  the  system,  while  with  probability  1  —  pk, 
the  customer  fails  to  complete  service  and  remains  in  the  queue.  The  arrival  pattern  {A(n)}o° 
of  _CVA+1-valued  RV’s,  with  Ak(n)  denoting  the  number  of  arrivals  to  the  kth  queue  in  the  slot 
[n,n  -f  1),  is  independent  of  the  initial  queue  size  and  of  the  service  processes,  and  is  modelled  as 
a  renewal  process,  in  that  the  batch  sizes  of  customers  arriving  into  the  system  in  each  slot  are 
independent  and  identically  distributed  from  slot  to  slot.  Under  these  assumptions,  the  MDP  of 
interest  is  modelled  by  the  J5VA+1-valued  process  {X(n)}§°,  where  Afc(n),  0  <  k  <  K,  represents 
the  queue  sizes  of  the  kth  queue  at  the  beginning  of  the  slot  [n,  n  +  1),  n  =  0, 1, _ 

Nain  and  Ross  (1986)  identified  the  service  allocation  policy  that  minimizes  the  long-run 
average  of  a  linear  expression  in  the  queue  sizes  of  the  K  queues  (1, . . . ,  I(}  subject  to  the  constraint 
that  the  long-run  average  queue  size  of  the  0th  queue  does  not  exceed  a  given  value  V.  With  the 
notation  used  here,  they  considered  the  constrained  problem  (CPv)  with  cost  functionals 


..  n—l  K 

J°( ff)  :=  E  c^k(t), 

t= o  k= 1 


(4.9) 


and 


n—l 


Jd(*)  :=  limBi£^Xo(t), 


t= o 


(4.10) 


where  c^,  1  <  k  <  I( ,  are  non- negative  weights. 


A  work  conserving  static  priority  assignment  policy  is  a  non-idling  service  allocation  policy 
with  fixed  priority.  With  this  notation,  the  results  are  given  under  the  following  assumptions 
(H6)-(H8),  where 

(H6)  The  stability  condition 


P  :=  V  —  <  1 


holds,  where  A :=  E*[Ak(n)]  for  all  i r  in  V  and  every  n  =  0,1,..., 
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(H7)  The  set  Vv  contains  at  least  one  work  conserving  static  priority  assignment  policy  that 
gives  the  highest  priority  to  the  0th  queue,  but  does  not  contain  any  work  conserving  static 
priority  assignment  policy  which  gives  the  lowest  priority  to  the  0th  queue, 

(H8)  For  some  r  >  2,  the  finite  moment  conditions 

K  K 

|Xfc(0)|r<oo  and  J^jE,r|Afc(n)|r  <  oo 

k= 0  fc=0 

hold  for  all  tt  in  V  and  every  n  —  0,1,.... 

Under  (H6)-(H7),  the  problem  ( CPv )  admits  (Nain  and  Ross  1986)  an  optimal  stationary 
policy  /*  which  is  obtained  by  simple  randomization  between  two  work  conserving  static  priority 
assignment  policies  ~g  and  g ,  as  determined  by  (4.2)-(4.5).  Under  (H8),  Makowski  and  Shwartz 
(1986b)  have  shown  that  the  RV’s  {X(n)}g°  are  uniformly  integrable  under  Pv  for  any  non-idling 
policy  7r  in  V.  Moreover,  for  any  non-idling  stationary  policy  g ,  the  Markov  chain  {X(n)}§°  forms 
a  single  ergodic  class  under  P 9  over  the  state  space  JNK+1.  These  facts  imply  readily  that  under 
(H6)-(H8),  Theorem  4.1  applies  to  the  steering  policy  a  defined  by  (2.2)  and  (2.4),  where  the  state 
^  is  chosen  arbitrarily  in  1NK+1. 

Theorem  4.4  Under  (H6)-(H8),  the  steering  policy  a  solves  the  constrained  optimal  resource  allo¬ 
cation  problem  (CPy)  with  Jc(a )  =  Jc(/*)  and  Jd(a)  =  Jd(f*)  =  V. 

For  K  —  1,  the  system  is  composed  of  two  queues  and  the  policy  g  (resp.  g)  specializes  to 
the  work  conserving  static  priority  assignment  policy  giving  higher  priority  to  the  1st  queue  (resp. 
the  0th  queue).  In  that  case,  the  steering  policy  a  constitutes  an  adaptive  policy  in  the  restrictive 
sense  understood  in  the  literature  of  adaptive  control  of  Markov  chains  (Kumar  and  Varaiya  1986) 
in  that  no  knowledge  of  the  model  parameters  is  needed  for  implementing  the  policy  a. 

5.  A  Proof  of  Theorem  3.5  by  Sample  Path  Arguments 

In  this  section,  Theorem  3.5  is  established  through  direct  sample  path  arguments.  The  discus¬ 
sion  is  carried  out  through  a  series  of  technical  lemmas. 

5.1.  Regenerative  properties  of  {X(n))o° 

To  study  the  performance  of  the  policy  a,  start  with  the  following  observation:  Under  the 
recurrence  assumption  (Al),  the  process  {X(n)}g°  is  regenerative  under  each  one  of  the  measures 
P3  and  P^  (Chung  1974),  while  it  need  not  be  so  under  Pa  owing  to  its  non-stationarity.  It 
thus  seems  reasonable  to  try  a  decomposition  of  this  non-stationary  process  into  two  regenerative 
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processes.  This  is  done  by  connecting  together  the  cycles  corresponding  to  the  use  of  each  one  of 
the  policies  so  that  results  from  the  theory  of  regenerative  processes  may  be  applied.  This  idea  is 
made  precise  in  the  lemma  below  and  the  arguments  that  follow  it. 

Let  t(m )  (resp.  tfm ))  be  the  left  boundary  of  the  slot  during  which  the  policy  g  (resp.  g )  is 
used  for  the  mth  time  so  that  r](t(m))  —  1  (resp.  r](t(m))  =  0).  Note  that  the  RV’s  f(m )  and  t(m ) 
are  IFn-stopping  times,  and  the  RV’s  X(m)  and  Xfm )  given  by 

X(m)  =  X(t(m))  and  X_(m)  —  X(t(m))  m  =  1,2, . .  .(5.1) 

are  thus  and  IFt(m) -measurable,  respectively. 

Lemma  5.1  Assume  the  recurrence  condition  (Al)  to  hold.  Under  Pa ,  the  RV’s  {X(m)}f°  (resp. 
{X(m)}f° )  form  a  time-homogeneous  Markov  chain  with  one-step  transition  probability  matrix  P(g ) 
(resp.  P(g)). 

Proof.  The  result  will  be  established  for  the  sequence  {X(m)}J°,  provided  the  equality 

Pa[X(t(m+  1))  =  y\JF-t(m)}  =  px{t(m))y{9 )  m  =  1,2... (5.2) 

can  be  shown  to  hold.  In  fact,  it  suffices  to  show  the  set  equality 

[X(t(m  +  1))  =  y\  =  [X(t(m)  +  1)  =  y\,  m  =  1,2. .  .(5.3) 

since  then 

Pa[X(t(m  +  1))  =  y\IF-t{rn)\  =  P°[X(t(m)  +  1)  =  y\lF-tim)\  m  =  1,2  . .  .(5.4) 

by  the  very  definition  of  a  and  of  the  stopping  time  t(m),  and  the  strong  Markov  property  now 
readily  yields  (5.2)  from  (5.4). 

The  proof  of  (5.3)  is  now  given  and  considers  two  cases.  If  y  ^  z,  then  necessarily  t(m  +  1)  = 
t(m)  +  1  and  (5.3)  is  trivially  true.  If  on  the  other  hand  y  =  z,  the  set  equality  (5.3)  (with  y  —  z) 
is  seen  to  hold  by  the  following  observations:  On  the  event  [X(t(m  +  1))  =  z\,  it  is  not  possible 
that  X(t(m)  +  1)  ^  z,  for  this  would  imply  t(m  +  1)  =  t(m)  +  1  by  the  very  definition  of  a,  thus 
leading  to  the  contradiction  X(t(m-f  1))  ^  z\  Consequently, 

[X(t(m  +  1))  =  z]  C  [X(t(m)  +  1)  =  z\.  m  =  1,2  . .  .(5.5) 
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Conversely,  on  the  event  [X(t(m)  + 1)  =  z],  the  epoch  t(m)  +  l  corresponds  to  the  end  of  a  cycle  and 
the  next  time  the  policy  g  is  used  necessarily  marks  the  beginning  of  a  cycle,  so  that  X(t(m+ 1))  =  z 
and 

[X(t(m)  +  1)  =  z]  C  [X(t(m  +  1))  =  z].  m  =  1,2  . .  .(5.6) 

The  result  (5.3)  is  now  obtained  by  combining  (5.5)  and  (5.6). 

These  arguments  apply  mutatis  mutandis  to  the  sequence  {X(^)}i°-  Details  are  left  to  the 
interested  reader.  n 

5.2.  The  key  convergence  results 

For  any  mapping  c  :  S  — ►  JR,  in  order  to  study  the  convergence  of  {  Jc(-r(&))}J°  as  k,  the 
number  of  cycles,  goes  to  oo,  let  T{1 )  (resp.  Tfl ))  denote  the  length  of  the  Ith  cycle  during  which 
the  policy  g  (resp.  g)  is  used,  and  set 

i  i 

r(0:=E?(5)  and  r(l):=£z(s).  /  =  1, 2, . .  .(5.7) 

S=1 

In  words,  T(l)  (resp.  r(l))  represents  the  total  number  of  slots  in  the  l  first  cycles  during  which  g 
(resp.  g)  is  used.  Moreover,  let  the  RV’s  ZC(l )  and  Zf(l )  defined  by 

r(0  _  r(J) 

Z\l):=  E  c(X(m ))  and  Zc(l):=  J2  C(M™))  1  =  1,2,...  (5.8) 

m=r(/  — 1)+1  m=r((— 1)+1 

represent  the  total  costs  over  the  Ith  cycle  during  which  the  policies  g  and  g  are  used,  respectively. 
In  the  definition  of  (5.8),  it  is  convenient  to  set  Z°{1)  =  0  (resp.  Z_c(l )  =  0)  if  r(l  —  1)  =  oo  (resp. 
r(J  —  1)  =  oo).  Thus,  under  Lemma  3.4,  the  quantities  Z°(l)  and  Z_c(l )  are  Pa- a.s.  well  defined 
and  finite  for  all  l  =  1,2,.... 

The  next  lemma  is  an  immediate  consequence  of  Lemma  5.1. 

Lemma  5.2  Assume  the  recurrence  condition  (Al)  to  hold.  For  any  mapping  c  :  S  -»  M,  the  RV’s 
{Z  (/)}i°  (resp.  {Zc(l)}f)  form  a  (possibly  delayed)  renewal  sequence  under  Pa .  Moreover,  if  the 
mapping  c  satisfies  the  condition  (A3bis),  then 

Ea[T(l)\  =  Ef[Zc ]  and  Ea[Zc(l)]  =  Ej[Zc].  Z  =  2,3, . .  .(5.9) 

Let  the  RV’s  P(k)  and  ufk)  count  the  total  number  of  cycles  in  the  first  k  cycles  that  g  and  g 
are  used,  respectively.  It  is  now  plain  that 

r(  fc)  —  1  r'(fe)  ir(fc) 

E  c(*W)  =  E  T (0  +  E  ^  pa  -  a-s-(5.10a) 

t= o  ;=i  /=i 
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for  each  k  =  1, 2, . . .,  and  with  c(x )  =  1  for  all  x  in  S,  this  last  relation  specializes  to 

V(k)  £.(&) 

t(*)  =  x;5V)+x;  21(7).  P“  -  a. 5. (5. 106) 

/=x  /=i 

By  virtue  of  Lemma  3.4,  lim^  V{k)  =  lim^  z/(&)  =  oo  so  that  the  next  lemma  is  now  immediate  from 
Lemma  5.2  and  the  Strong  Law  of  Large  Numbers. 

Lemma  5.3  Assume  (A1)-(A4)  to  hold.  For  any  mapping  c  :  S  —*  1R  satisfying  (A3bis),  the 
convergences 


v(k)  i i(k) 


hi 


Pa  —  a.  s.  (5. 11a) 


take  place,  and  in  particular, 


f(k)  ~  u(k) 

lim‘^jET(')  =  -Erm  and  = Eim 


Pa  —  a.  s.  (5. 116) 


Set 


q(k)  := 


u(k) 


k  =  1,2, ...(5.12) 


and  note  that  =  1  —  q(k).  The  relations  (5.10)  imply  that 


r(k) 


.  "(*0  .  idk) 


Pa  —  a.  s.  (5. 13a) 


Eg?  m 


P(r(k))  = 


Ejli  T0)  +  (!  “  l(k))pp 


.(*) 


and 


g(*)  At  Elf?  z‘(0  +  (1  -  imjt,  Eft?  g(0 

«(*)  At  e5?  no  +  (i  -  ?(0)  At  Eg?  ho 


P“  -  a. s. (5. 136) 


P“  —  a.s.(5.13c) 


for  all  k  =  1,2,...,  where  the  convention  ^  =  0  is  used. 

For  any  mapping  c  :  S'  —*  1R  to  satisfy  (A3bis)  with  the  quantity  E9z[Zc ]  +  Ej[Zc]  being 
well  defined  (but  possibly  infinite),  it  is  now  plain  from  Lemma  5.3  and  (5.13)  that  under  P“, 
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the  sequences  of  RV’s  (p(r(A:))} and  {Jc(r(k))}^°  converge  a.s.  if  the  RV’s  { q(k )}J° 

converge  a.s..  This  key  convergence  result  of  the  RV’s  {<z(&)}i°  is  taken  on  in  the  next  proposition 
whose  proof  is  delayed  until  the  next  section. 

Theorem  5.4  Under  (A1)-(A4),  the  RV’s  {<?(&)}“  converge  Pa-a.s.  to  the  constant  q*  given  by 
(3.10),  i.e., 

lim  kq{k)  =  q* .  Pa  —  a.s. (5. 14) 


With  the  help  of  Theorem  5.4,  Theorem  3.5  can  now  be  proved  easily. 

A  proof  of  Theorem  3.5.  From  the  the  remarks  made  earlier,  it  follows  readily  that  under  (5.14), 
the  Pa- a.s.  limits  of  the  sequences  of  RV’s  {Kr(^))}i°  and  {JC(T(k))} i°  are  necessarily 

given  by 

limfc^  =  q*Ef[T\  +  (1  -  q*)E§[T],  Pa  -  a.s.(5.15a) 


lim/cp(r(fc)) 


q*Ej[T] 

q*Ei[T)  +  (1  -  g*)Ej[T] 


and 


limfcJ°(r(/c)) 


q*E°[Z°]  +  (1  -  g*)E§-{Zc] 
q*E9z[T}  +  (l-q*)E§[T]  ’ 


Pa  —  a.s.  (5. 156) 


Pa  —  a.s. (5. 15c) 


respectively.  While  (5.15a)  gives  (3.13),  simple  algebraic  calculations  based  on  (3.6)  and  (3.10) 
easily  yield  (3.11)-(3.12)  from  (5.15b)-(5.15c).  The  proof  of  Theorem  3.5  is  therefore  complete.  □ 

5.3.  A  proof  of  Theorem  5.4 

Crucial  to  the  proof  of  Theorem  5.4  is  the  following  deterministic  lemma. 

Lemma  5.5  Let  {a(fc)}j°,  {b(k)}^°  and  {^(A;)}^0  be  IR-valued  sequences  satisfying  the  conditions 

b(k)  >  0  and  bfk)  >  0  k  —  1, 2, . . .  (5.16a) 


and 


limfc6(A;)  =  0,  lim  kk(k)  =  0  and  limfca(A:)  =  a 
for  some  a  in  M.  If  the  IR-valued  sequence  {d(k)}f^  is  defined  recursively  by 


(5.166) 


(  6(h)  -  b(k)  if  0(k)  >  a(k); 

9{k  +  1)=<  k  =  1,2,...  (5.17) 

[  0(k)  +  b(k)  if  6(k)  <  a(k), 
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with  <?(1)  arbitrary  in  1R,  then  either  {0(&))i°  converges  monotonically  (in  the  tail)  to  some  constant 
0(oo)  7^  a,  or  linu.#(fc)  =  a. 

Proof.  By  assumption,  given  e  >  0,  there  exists  a  positive  integer  ke  such  that  b(k )  <  e,  b(k)  <  e 
and  | a( A;)  —  a\  <  e  for  all  k  >  ke,  and  define 

me  =  inf{fc  >  ke  :  9(k)  6  (a  —  e,  a  +  e)}.  (5.18) 

If  me  =  oo,  then  6(k)  is  not  in  the  interval  (a  —  e,  a  +  e)  for  all  k  >  ke.  If  0(ke)  <  a  —  e,  then 
0(k)  <  a  —  e  <  a(k)  for  all  k  >  k€.  To  see  this,  recall  that  b(k)  <  e  for  all  k  >  ke,  and  from 
(5.17)  this  implies  6(ke_  +  1)  <  a,  whence  0(ke  +  1)  <  a  —  c  by  the  definition  of  me.  An  induction 
argument  now  shows  that  0{k)  <  a  -  e  for  all  k  >  ke,  so  that  by  (5.17)  the  sequence  {^(A:)}^  is 
monotone  increasing  from  time  kt  onward,  and  must  converge  to  some  value  0(oo)  <  a  -  e.  The 
case  0(ke)  >  a  +  e  is  similarly  discussed. 

Suppose  m6  <  oo  so  that  6(me)  now  lies  in  (a  —  e,  a  +  e).  From  (5.17)  again,  it  follows  that 

a  -  e  -  b(me)  <  0(me  +  1)  <  a  +  e  +  b(me).  (5.19) 

If  in  (5.19),  a  —  e  <  0(me  +  1)  <  a  +  e,  then  the  inequalities 

a  -  e  -  b(mt  +  1)  <  0(m c  +  2)  <  a  +  e  +  b(mt  +  1)  (5.20a) 

hold.  On  the  other  hand,  if  in  (5.19),  0(me  +  1)  ^  (a  —  e,  a  +  e),  then  two  cases  are  possible:  Either 
(i)  a  -  e  -  b(me)  <  9(me  +  1)  <  a  -  e  in  which  case  6(me  +  1)  <  a(me  +  1)  and  therefore 

a  -  c  -  b(me)  +  b(m€  +  1)  <  9(me  +  2)  <  a  -  e  +  b(mt  +  1),  (5.205) 

by  making  use  of  (5.17)  or  (ii)  a  +  e  <  6(me+  1)  <  a  +  e  +  b(me )  in  which  case  0(me  +  1)  >  a(me  + 1) 
and  therefore 

a  +  e  —  b(me  +  1)  <  0(me  +  2)  <  a  +  e  +  b(mt)  —  b(me  +  1).  (5.20c) 

It  follows  easily  from  (5.20)  that 

a  —  e  —  max{5(me),  b(me  +  1)}  <  0(mt  +2)  <  a  +  e  +  max{5(me),  6(me  +  1)}. 

An  induction  argument  now  implies  that  the  inequalities 

a  -  c  -  max  b(me  +  i)  <  0(me  +  l)  <  a  +  e  +  max  b(me  +  i)  (5.21) 
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holds  for  all  /  =  1, 2, . . ..  Since  me>  ke,  the  definition  of  ke  yields 


a  -  2e  <  6(k )  <  a  +  2e 


for  all  k  >  me,  and  e  being  arbitrary,  the  proof  is  now  complete. 

A  proof  of  Theorem  5.4  is  now  presented. 

A  proof  of  Theorem  5.4.  Define  the  RV’s  {F(k)}i°  by 


Y(k )  := 


£f=i  g(Q 

alk) 


V- 


ikk) 


'E; 


:[k)  zd(l) 


iy(fc) 


Ef=*  T<f) 


~#r 


»/(*:)  / v 


n 


Jfc  =  1,2,... (5.22) 


and  observe  from  (5.13c)  (with  d  replacing  c)  that  Jd(r(k))  >  V  if  and  only  if  q(k )  >  Y(k).  The 
definition  of  a  implies  that  the  RV’s  {<z(&)}i°  are  defined  recursively  by 


q(k  +  i)  = 


'  q(k)  -  k+l<l(k)  if  q(k)>Y(k); 

,  q(k)  +  FflC1  _  9(k))  if  q{k)  <  Y (k). 


k  =  1,2,...  (5.23) 


Under  (A1)-(A4),  it  follows  from  Lemma  5.3  that 


lim^y  ( k )  = 


_ Ej[T]V  -  Ej[Zd } _ 

(El[Zd]  -  E97[Zd ])  -  ( Ei[T ]  -  E97[T})V 


Pa  -  o.s.(5.24) 


so  that 


limfcF  (k)  = 


p*Ej-[T] 


(: l-p*)E9z[T]  +  p*E§{T ] 
by  simple  algebraic  manipulations  based  on  (A4),  (2.7),  (3.2)  and  (3.10). 


Pa  -  a. s. (5.25) 


Pick  a  sample  uj  not  in  the  Pa-null  set  on  which  (5.25)  fails  and  set  6(k)  =  q(k,u),  a(k)  = 
Y(k,u> ),  b(k )  =  and  b(k )  =  for  all  k  —  1,2,...,  and  note  that  a  =  q* .  Since 

0  <  q(k,u )  <  1,  the  assumptions  of  Lemma  5.5  are  immediately  satisfied,  and  the  a.s.  convergence 
of  the  RV’s  {^(k)}!0  follows.  It  is  not  possible  for  the  values  {?(^,w)}i°  to  converge  monotonically 
(in  the  tail)  to  some  value  not  equal  to  q* ,  for  this  would  imply  that  the  policy  a  sticks  to  one 
policy  from  some  cycle  onward,  in  clear  contradiction  with  Lemma  3.4.  ["”| 
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